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About me 


Vladislav Pyatkov 

| work at GridGain 

Our product is based on Apache Ignite 

Today | am an Apache Ignite committer 

I have been developing distributed databases for 
more than five years 
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Node 


One server in the distributed database 
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Cluster 


A set of nodes is also called topology 
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Data 


The rule to map entries to nodes is called affinity function 


Table 1 


Table 2 
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Partition 


The result of the affinity function calculation is called distribution 


Table 1 


Table 2 
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Partition internals 


Each partition contains a set of entries 


Table 1 


Table 2 
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Partition 


Copies of one partition on different nodes are called replicas 


Table 1 
partitions = 3 
replicas = 2 


Table 2 
partitions = 3 
replicas = 3 
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Table of content 


-> Data movement processes 
o Rebalancing 
o Replication 
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Table of content 


e Data movement processes 
o Rebalancing 
o Replication 
-> Replication in a stable distribution 
o Rebalancing does not happen. 
o Replication happens. 
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Table of content 


e Data movement processes 
o Rebalancing 
o Replication 

e Replication in a stable distribution 
o Rebalancing does not happen. 
o Replication happens. 

3 Changing partition distribution 

o Topology changes due to adding and removing nodes. 
o Replication factor changes. 
o Other scenarios when partition distribution is changed. 
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e Data movement processes 
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o Replication happens. 
e Changing partition distribution 
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3 What happens if all replicas should be moved to new nodes during redistribution? 
o RAFT consensus 
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Data movement processes 
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Rebalancing 


A process of moving data between nodes in order to make the distribution as close 
to the uniform as possible 


Before After 


KA 7 
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Replication 


Ensure consistency across partition replicas 


replicate 
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Two approaches to replication 


Data entries are transferred Log records are transferred 
between replicas. between replicas. 


For both cases, we assume that the data (the partition) is 
replicated on three nodes. 
The set of nodes is called the partition's replication group. 
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Two approaches to replication 


Data entries are moved between Log records are transferred 
replicas. between replicas. 


Replica 1 Replica 2 Replica 3 
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Data entries storage 
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Two approaches to replication 


Data entries are moved between Log records are transferred 
replicas between replicas 
Replica1 Replica 2 Replica 3 Replica 1 Replica 2 Replica 3 
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Log records 
storage 
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Two approaches to replication 


Data entries are moved between Log records are transferred 
replicas. between replicas. 


Replica 1 Replica 2 Replica 3 Replica 1 Replica 2 Replica 3 
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ION 


ion in a stable distribut 


Replicat 
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Partition lag is handled by replication 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


3 => insert(3) {6, insert(4); 3 => insert(3) 
2 => insert(2) 5, update(2); 2 => insert(2) 
1 => insert(1) 4, delete(3)) 1 => insert(1) 
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Partition lag is handled by replication 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


3 => insert(3) 
2 => insert(2) 
1 => insert(1) 
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6 => insert(4) 
5 => update(2) 
4 => delete(3) 


3 => insert(3) 
2 => insert(2) 
1 => insert(1) 


These updates will 
be applied serially 
to the storage. 


Partition lag is handled by replication 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


3 => insert(3) 
2 => insert(2) 
1 => insert(1) 


3 => insert(3) 
2 => insert(2) 
1 => insert(1) 
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Partition lag is handled by replication 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


3 => insert(3) 
2 => insert(2) 
1=> insert(1) 
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6 => insert(4) 
5 => update(2) 


4 => delete(3) 
3 => insert(3) 
2 => insert(2) 
1 => insert(1) 


Partition lag is handled by replication 


6 => insert(4) 


5 => update(2) 
4 => delete(3) 
3 => insert(3) 
2 => insert(2) 
1 => insert(1) 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


3 => insert(3) 
2 => insert(2) 
1 => insert(1) 
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Partition lag is handled by replication 


6 => insert(4) 
5 => update(2) 
4 => delete(3) 


6 => insert(4) 

5 => update(2) 
4 => delete(3) 
3 => insert(3) 
2 => insert(2) 
1 => insert(1) 


3 => insert(3) 
2 => insert(2) 
1 => insert(1) 
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Truncated log 


50 => update(1) 
49 => update(1) 


ids =3 


3 => update(1) 


2 => update(1) 
1 => insert(1) 


Truncated log 


Clear the storage before applying a 
snapshot. 


Snapshot 3 => update(1) 
50 E 2 => update(1) 
49 => FEE] 1 => insert(1) 
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Truncated log 


Write all data from the snapshot to the 
storage. 


Snapshot 3 => update(1) 
50 => update(1) 2 => update(1) 
49 => update(1) 1 => insert(1) 
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Truncated log 


Although the logs are different, the persistent states are 
the same. 


id = 50 


50 FEE] 
49 FEE] 
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Changing partition distribution 
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Removing a node 


The replication factor is 2. The replication factor is temporarily 
violated. Hence, the factor needs to be 
recovered. 
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Removing a node: moving to a new distribution 


The replication factor is 2. Again, each partition contains two 
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Removing a node: choose one partition 


The replication group for partition <P> is changed from {1, 3} old distribution 
to di 2) new de 
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Removing a node: changing in replication groups 


The replication group for partition <P> is changed from {1, 3} old distribution 
to {1, 2} new distribution. 


node 1 node 2 node 3 node 1 node 2 
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Adding nodes 


Data is distributed across both The third node space is empty. Hence, 
cluster nodes. the space should be adjusted. 
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Adding nodes: moving to a new distribution 


Data is distributed across both Data is distributed across three cluster 
cluster nodes. nodes as evenly as possible. 
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Adding nodes: choose one partition 


The replication group for partition <P> is changed from {1, 2} old distribution 
to 1, 3) ew CIS tine 


1 2 3 
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Adding nodes: changing in replication groups 


The replication group for partition <P> is changed from {1, 2} old distribution 
to {1, 3} new distribution. 


Node 1 Node 2 Node 1 Node 2 Node 3 


— ey 
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Adding nodes: changing in replication groups 


The replication group for partition <P> is changed from {1, 2} old distribution 
to {1, 3} new distribution. 


Node 1 Node 2 Node 1 Node 2 Node 3 


There is no replica of 
partition <P>. 


— ey 
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Redistribution steps 


Recalculate a distribution in a new topology 

Determine partitions whose replicas have to be moved 

The partitions are changing their replication node sets to an intermediate 
distribution 

The replicas need to be copied from one node to another 

After the replicas are moved the cluster switches to the new distribution 
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Modifying the distribution 


The cluster has intent to modify the distribution (1, 2, 3} to a new one (4, 5} 
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Modifying the distribution 


The cluster has two different distributions, the previous distribution {1, 2, 3} and 
the new one {4, 5} 
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Modifying the distribution 


The new distribution {4, 5} fully replaces the previous one 
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RAFT 


e RAFT is consensus algorithm based on a log replication 
e The protocol process writes to one node (leader) and reads from any node 
e The log replication acts by moving of log records 
e Asnapshottransfer is also supported by the algorithm 
e Thealgorithm allows changing of a replication group (quorum) 
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RAFT consensus 


Makes the protocol reliable in case of split brain 
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RAFT consensus 


Makes the protocol reliable in case of split brain 


Each of the three nodes 
contains a partition replica 
(the nodes are a replication 


group) 
PA Él \ 
1] — | 
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RAFT consensus 


Makes the protocol reliable in case of split brain. 


[3 | The connection with node 1 


is broken. 
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RAFT consensus 


Makes the protocol reliable in case of split brain 


This part has a quorum 
and so continues to 
process operations. 


This part can't process operations 51 
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RAFT consensus 


Makes the protocol reliable but has limitations. 


Two nodes contain the same 
partition replica. 
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RAFT consensus 


Makes the protocol reliable but has limitations 


A similar network issue 
happens. 53 
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RAFT consensus 


Makes the protocol reliable but has limitations 


Neither part can process operations 


The cluster is unavailable, 
but the data is not lost. 54 


PAPAPK 
r++++++++++++++<++++++4+ td 
TRT RR dd dd 


i Me H H H H H e e e e e e e e e e e v e e e e e e e e e Ñ 


da da la AA d da dd de dd 
d.d da d da d dd d da d da 
da da d da da dL RA dd da d 
da d d da da d da da d da 
da da d a da d t+ +++ 

da d htt AA dd 
Pett tt ttt 
-+++++++++ 

da d d dA de dd 

da la d da da d +++ 


edu da da d d d da x 
-+++++++ 


Conclusion 
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Conclusion 


3 Both replication and rebalance are required to use for data movement in any 
distributed database. 
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Conclusion 


e Both replication and rebalance are required to use for data movement in any 
distributed database. 
3 Using log records for replication allows using replication for rebalance 
purpose. 
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Conclusion 


e Both replication and rebalance are required to use for data movement in any 
distributed database. 
e Usinglog records for replication allows using replication for rebalance 
purpose. 
-> Replication does not care about concurrent data updates. 
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Conclusion 


e Both replication and rebalance are required to use for data movement in any 
distributed database. 

e Using log records for replication allows using replication for rebalance 

purpose. 

Replication does not care about concurrent data updates. 

3 Using a log for replication is better than using snapshots for replication. 
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Conclusion 


e Both replication and rebalance are required to use for data movement in any 
distributed database. 

e Using log records for replication allows using replication for rebalance 

purpose. 

Replication does not care about concurrent data updates. 

Using a log for replication is better than using snapshots for replication. 

-> RAFT is a reliable algorithm for copying logs, but it has limitations. 
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Leave your feedback! 


You can rate the talk and 
give feedback on what 
you've liked or what could 


be improved S uu 
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